TextPipe: Online Help
    EasyPatterns Reference
 

Submit feedback on this topic 

 Home  User Assistance   Tutorials   How to Use TextPipe
 Menus: File   Edit   Filters[ Convert   Add   Remove   Unicode   Replace   Special   Map   Email   Restrict ]  Tools   Window   Help   Advanced
Home
Up

 

 

Note: For the latest copy of this documentation, please see the complete EasyPatterns reference (online)

Literal Text, Keywords and Expressions

Notation: "..." (without the quotes) is any appropriate text, keyword or expression.

The entire pattern is a string, so it is enclosed in double-quotes. With a few exceptions, the quotation marks will not be shown in this section.

the simplest pattern is just literal text
... e.g. "abc"
EasyPattern keywords are enclosed in brackets
[...] e.g. "[letter]"
literal text & keywords can be combined
... [...] e.g. "abc[digit]" matches "abc1", "abc2", etc.
[...] ... e.g. "[digit]abc" matches "1abc", "2abc", etc.
... [...] ... e.g. "[digit]abc[digit]" matches "1abc1", "1abc2", "2abc1", etc.
multiple keywords can appear together
[...][...] e.g. [letter][digit] matches "a1", "b1", etc.
[..., ...] e.g. [letter, digit] instead of [letter][digit]
[... ...] e.g. [letter digit] instead of [letter][digit]
  • Any "][" pair in a pattern (other than within a quoted string) can be replaced with a comma (followed by an optional space). Any comma (and optional space) in a pattern can be replaced by "][". The meaning is the same.
  • Any "][" pair or comma in a pattern (other than within a quoted string) can be replaced with a space (actually, one or more spaces), any space in a pattern can be replaced by a comma or "][". The meaning is the same.
literal text can be included inside a bracketed expression using single quotes
['literal'] e.g. "['abc']" instead of "abc"
[... 'literal'] e.g. "[digit, 'abc']" instead of "[digit]abc"
['literal' ...] e.g. "['abc', digit]" instead of "abc[digit]"
[... 'literal' ...] e.g. "[digit, 'abc', digit]" instead of "[digit]abc[digit]"
  • There is usually no difference in meaning between including literal text within a bracketed expression (in single quotes) and leaving literal text outside the brackets. The choice is a matter of individual preference. One exception: when using [not], only the single-quoted literal will work, e.g. [not '-'].

Essential Keywords

The most important keywords represent character sets, that is, any of a set of related characters.

anything, letters, digits, etc.
[character], [char], [chars], [characters] All 256 chars (every character including NULL)
[letter], [letters] includes é and ü, common in certain European languages.
[digit], [digits] decimal digits 0-9
[hexdigit] hexadecimal digits 0-9, a-f and A-F
[punctuation] printing characters, excluding letters and digits, includes !?.,:; " ' ' / - () {} -
[symbol], [symbols] ~@#$%^&*
  • EasyPattern's [character] or [char] will match any character including return. If you want any character except a return (or formfeed), use [paragraphChar]; that is, any character that could appear in a paragraph. Details below.
  • EasyPattern distinguishes punctuation from symbols; the sets do not overlap. For broader combinations, see [printableChar] and [typewriterChar]. For narrower focus, see [sentencePunctuation], [anyQuote], [anyBracket] and [anyDash].
  • Note that « and » are considered punctuation.
special letters
[upper], [uppercase], [uppercaseLetter] uppercase letters
[lower], [lowercase], [lowercaseLetter] lowercase letters
Reserved punctuation
[leftBracket] [
[rightBracket] ]
[leftParen], [leftParenthesis] (
[rightParen], [rightParenthesis] )
[comma] ,
[singleQuote] '
[doubleQuote], [quote] " (i.e. standard ASCII "straight" quotation mark)
[backwardSingleQuote] `  
  • EasyPattern gives special meaning to certain punctuation marks; these keywords can be used to represent the literal character.
  • Square brackets are "reserved" by EasyPattern. The other punctuation marks listed here are literal when they appear outside of brackets.
  • A literal comma and left & right brackets & parentheses may appear inside single quotes. The keywords are provided to make patterns easier to read.

Character Sets

There are many ways to create your own character sets to match exactly the characters you require.

combining character sets with "or"
[... or ...] e.g. [letter or digit], ['a' or 'b']
  • Most EasyPattern keywords refer to a set of characters from which one will match. The first use of "or" is to make a larger set -- though again, only one of the larger set will match. (Any quantity can be specified using repetition keywords, but it is still applying a quantity to a single character not to multiple characters)
  • When sets are combined with "or", parentheses are optional. (In technical terms, the character set use of "or" has very high precedence; see below)
     "[letter, letter or digit, letter]" -- matches "aaa", "xyz", "h4q", "b7f" etc.

The commas are optional here too, the "or" implicitly groups any character set keywords:
 "[letter letter or digit letter]" -- same as above

Of course, it doesn't hurt to add parentheses even though they are not required.
 "[letter (letter or digit) letter]" -- same as above

negation
[not ...], [non ...], [anyExcept ...] e.g. [oneOrMore non letter]
  • Instead of specifying all the characters that could occur in a match, it is often convenient to specify characters that could not occur. EasyPattern has keywords for [quotedString] and [HTMLTag], but if it didn't, they would be easy to define:
     ['<', oneOrMore not '>', '>'] « same as [HTMLTag]
     [quote, oneOrMore not quote, quote] « a simple definition for [quotedString]
  • Negation can only be applied to a single character, or a character set from which one will match. For example:
     [not letter] « fine
     [not letter or digit] « fine: [letter or digit] is a set from which one will match
     [not word] « ERROR: "word" matches multiple characters
     [not 'a'] « fine (a single character)
     [not 'whatever'] « ERROR
     
    [not lineChar or letter] « CAREFUL! [lineChar] is defined as [not linefeed OR verticalTab OR formfeed OR return]. You cannot combine negated and non-negated characters sets, so this pattern is equivalent to [ (not lineChar) or letter ], instead of [ not (lineChar or letter) ]
custom sets
[<...>] e.g. [<aeiou>], [<135>], [<!@#$%^&*>]
  • Keywords such as [letter] and [digit] are character sets defined internally to EasyPattern; the angle bracket notation lets you define your own characters sets. In both case, EasyPattern matches any single character in that set.
  • For single characters, [or] and a set are interchangeable, e.g. [<aei>] and ['a' or 'e' or 'i'] have the same meaning.
  • User defined sets, single character literals and EasyPattern keywords can be combined with [or]:
     [<aeiou> or <123> or '7' or symbol]

Alternatives

alternative patterns with "or"
[... or ...] e.g. ['Player' or 'EasyPattern']
  • When "or" is used to specify alternatives as part of a larger pattern, grouping parentheses are required, e.g.
     "[space, 'Player' or 'EasyPattern', space]" -- may not mean what you think!
     "[space]Player[or]EasyPattern[space]" -- may not mean what you think!
     "[(space, 'Player') or ('EasyPattern', space)]" -- that's what they mean
     "[space ('Player' or 'EasyPattern') space]" -- this might be what you wanted

Remember: as noted in the section on expressions, commas are allowed between items to make patterns easier to read; they do not affect what the pattern means.

  • If you leave out the parentheses, EasyPattern will treat everything to the left of the "or" as one implicit group and everything to the right of the "or" as a separate group. Note that visual grouping with brackets or commas is not enough; you must use parentheses. For example, all of the following will be interpreted as "[(digit, 'this') or ('that')]":
     "[digit, 'this' or 'that']" « careful; the commas may mislead
     "[digit]['this' or 'that']" « careful; the brackets may mislead
     "[digit 'this' or 'that']" « the grouping isn't clear; parentheses would help
     "[digit]this[or]that" « the grouping isn't clear; parentheses would help

As noted in the previous section, parentheses are not required when "or" is used to combine character sets.

"or" as set vs. "or" as alternative

In many cases, you don't have to worry that there are two different uses for "or"; both generally make sense in context. However, there are 2 reasons for learning the differences:

  • or as set doesn't require parentheses; the grouping is implied
  • or as set can be part of a "not" expression since it still represents one character

Repetition, Quantity

Notation: "..." is any appropriate keyword or expression, # is a number (one or more digits; the maximum varies with context).

repetition examples will match...
[optional ...], [zeroOrOne ...] [digit, optional letter], [digit, zeroOrOne letter] 2, 2a
[0+ ...], [zeroOrMore ...] [digit, zeroOrMore letters] 2, 2a, 2aa, 2aaa, 2aaaa...
[1+ ...], [oneOrMore ...] [digit, oneOrMore letters] 2a, 2aa, 2aaa, 2aaaa...
[2+ ...], [many ...], [twoOrMore ...] [digit, many letters], [digit, twoOrMore letters] 2aa, 2aaa, 2aaaa...
[#+ ...] [digit, 5+ letters] 2aaaaa, 2aaaaaa...
  • A space is not allowed, e.g. "one or more" will not be recognized
  • The words are all special cases, e.g. "threeOrMore" will not work (use "3+")
specific quantity, quantity range (where # is a number) will match...
[# ...] [5 letters] aaaaa, bbbbb
[# to # ...] [3 to 5 letters] aaa, aaaa, aaaaa

 

shortest vs. longest match
[shortest ... ...] match the lowest possible number of repetitions (default)
[longest ... ...] match the highest possible number of repetitions
  • EasyPattern defaults to the SHORTEST match so the "shortest" keyword is optional.
  • When the repetition or count includes a range of values to match, EasyPattern has the choice of matching the "shortest" sequence of characters that fits the pattern, or the "longest" that fits the pattern. For example:
[shortest zeroOrOne ...] 0 or 1 will try to match zero occurrences
[shortest zeroOrMore ...] 0+ will try to match zero occurrences
[shortest oneOrMore ...] 1+ will try to match one occurrence
[shortest twoOrMore ...] 2+ will try to match two occurrences
  • In these cases, EasyPattern will only match more than the minimum if required to complete additional parts of the pattern, e.g. given "abc123" and the pattern "[shortest oneOrMore letter, digit]", EasyPattern will match "abc1", i.e. all 3 letters. However, given the same string and the pattern "[shortest oneOrMore letter]", EasyPattern will just match "a" the first letter. Given the same string and "[longest oneOrMore letter]", EasyPattern will match "abc". Note that EasyPattern always starts with the first character that matches that pattern, e.g. despite "c1" being shorter than "abc1", EasyPattern matches the latter.
  • Shortest/longest can be confusing.
  • Shortest can be quite slow, use "not" if possible

literals, groups

All of the repetition & quantity keywords can be applied to literals and groups as well as to individual keywords, e.g.
 [oneOrMore 'ab'] « matches "ab", "abab", "ababab" etc.
 [oneOrMore letter or digit] « matches "aaa", "456", "a45bbb" etc.
 [oneOrMore not letter or digit] « matches punctuation, symbols, whitespace etc.
 [oneOrMore ('alpha' or 'omega')] « matches "alphaalapha", "alphaomega" etc.
 [oneOrMore (letter, digit)] « matches "r2", "r2d2", "r2d2f7b2c4" etc.

Grouping and Variable Assignment

[(...)] A non-capturing group
[capture(...)] Matching text is captured into 'group#' in the pattern, and into $# in the replacement.

# can range from 1 to 26

[group#] # can range from 1 to 26 e.g.

[(letter)1, group1] « matches "ee", "bb", "cc" etc

[mustBeginWith(...) ...], [mustNotBeginWith(...) ...] When a match is found, it must be/must not be preceded by what is in the brackets. The bracket contents are NOT included in the actual match. The bracket contents are limited to fixed length strings - so no '3+' etc are allowed. This must be the first part of your pattern.

[mustBeginWith( 'hello' or 'goodbye' ) 'fred']

[... mustEndWith(...)], [... mustNotEndWith(...)] When a match is found, it must be/must not be followed by what is in the brackets. The bracket contents are NOT included in the actual match. The bracket contents are limited to fixed length strings - so no '3+' etc are allowed. This must be the last part of your pattern.

['fred' mustEndWith( 'erick' or 'dy' ) ]

  • Parentheses without trailing digits form groups, e.g. to apply a quantity or repetition.
  • Commas are optional, e.g. the following patterns are equivalent:
     [( letter )1 ( digit )2]
     [( letter )1, ( digit )2]
  • By adding a number immediately after ")", you are in effect assigning the contents of the group to a variable; the variable can be referred to using "group#", described below. Note: if you need more than 26 variables, please send us an example to illustrate why!
  • Parentheses must match, i.e. ")" always ends the most recent "(", independent of number.

Comments

EasyPattern allows comments to be included in multi-line patterns using the character ';' or '#' to make the start of a comment, extending until the end of the line e.g.

[ 3 space ;look for 3 spaces
  'hello'    #then the keyword we want
]

Whitespace

whitespace (including items covered above)
[space], [spaces] ASCII 32
[nonbreakingSpace] ASCII 202
[whitespace] [space OR tab OR cr OR lf OR verticalTab OR nonbreakingSpace]
[tab] ASCII 9, \t
[return], [cr] ASCII 13, \r
[linefeed], [lf] ASCII 10, \n
[verticalTab] ASCII 11
[formfeed] ASCII 12, \f
[null] ASCII 0
[CRLF] [return, linefeed]
[newline] [(return, linefeed) or return or linefeed]
[DOSNewline] [return, linefeed]
[UNIXNewline] [linefeed]
[MacNewline] [return]
  • "not" cannot be applied to [CRLF], [newline] or [DOSNewline] since they either are or may be a character sequence rather than just a single character.
  • A space character can usually be typed directly into a pattern ([ ' ' ]) but using the keyword may make the pattern easier to understand (and modify later)
whitespace combinations
[horizontalWhitespace], [hSpace] [space or nonbreakingSpace or tab]
[verticalWhitespace], [vSpace] [return or linefeed or formfeed or vertical tab]
words, columns, lines & paragraphs
[wordDelimiter] [space OR tab OR linefeed OR verticalTab OR formfeed OR return]
[wordChar] [not wordDelimiter]
[word] [1+ wordChar]
   
[columnDelimiter] [tab OR linefeed OR formfeed OR return]
[columnChar] [not columnDelimiter]
[column] [1+ columnChar]    Note: Use [0+ columnChar] instead if the column could be blank
   
[lineDelimiter] [linefeed OR verticalTab OR formfeed OR return]
[lineChar] [not lineDelimiter]
[line] [1+ lineChar]    Note: Use [0+ lineChar] instead if the line could be blank
   
[paragraphDelimiter] [formfeed OR return]
[paragraphChar] [not paragraphDelimiter]
[paragraph] [1+ paragraphChar]
  • The above delimiters are characters not positions; they will "consume" the character that they match. In contrast, [TextStart] and [TextEnd] (below) are positions.
  • The above objects (word, column, line, paragraph) do not include delimiters. So, to match multiple objects, you need to include the delimiters, e.g.
     [2+ word] -- won't match anything
     [2+ (word, optional wordDelimiter)] -- correct
  • The definition for word is based strictly on whitespace so it will include punctuation, matching text such as "$27.52" and "fancy+name". Although in many cases it would be nice to exclude trailing punctuation, that pattern would fail for text like "S.M.U.". When EP's definition of a word isn't appropriate for your text, simply use the custom pattern that fits. For example, [1+ wordChar, letter or digit or symbol]" would ensure that the last char is not punctuation.
  • Word, column, line & paragraph require one or more character. If a line might be empty, use: [0+ lineChar] instead of [line].
  • Because the definitions for word, column, line & paragraph look for anything except the appropriate delimiter (rather than the leading delimiter, a series of anything else, and the trailing delimiter), they can be used to get the rest of a word, column, line & paragraph when the starting point is already in the middle. See the example scripts for details.
  • These definitions allow control characters (except the specific whitespace used as delimiters) to appear in words, columns, lines & paragraphs.
  • A column may contain the verticalTab character. (It's used by FileMaker to indicate line breaks within a field.)
  • Word, column, line & paragraph consist of multiple characters so patterns like "[not word]" don't make sense.
positions
[textStart] matches at start of entire text
[textEnd] matches at end of the entire text or before newline at end
[lineStart] matches the start of a line
[lineEnd] matches the end of a line
[wordBoundary] matches at a word boundary
[notWordBoundary] matches when not at a word boundary

More Keywords

combinations
[controlChar] characters 0-31, 127 (careful: includes most whitespace)
[gremlin] characters 0-31. The definition for [gremlin] is more cautious than in some products.
[printableChar] [letter or digit or punctuation or symbol] (anything that prints ink on paper)
[typewriterChar] [printableChar or space or tab or return] (excludes linefeed, vertical tab & formfeed)

 

punctuation subsets (these items are included in [punctuation])
[sentencePunctuation] .,;:!?¿¡
[anyBracket], [anyBrackets] left/right paren/bracket/brace (i.e. "bracket" in the broad sense of the term)
[anyQuote] [doubleQuote OR singleQuote OR backwardSingleQuote]
[dash], [hyphen] -    used interchangeably. we have adopted the common notion that these terms refer to the same character
[period] .
[caret] ^
[pound], [hash] #
[slash] /
[backslash] \
[colon] :
[percent] %
[star], [asterisk] *
[ampersand] &

 

real-world patterns
[HTMLTag] <[1+ not '>']>
[HTMLStartTag] <[not '/', 0+ not '>']> (i.e. any tag except an end tag)
[HTMLEndTag] </[1+ not '>']>
[QuotedString] [quote, 1+ ((backslash, quote) or not quote), quote]
[SocialSecurityNumber] [3 digits, dash, 2 digits, dash, 4 digits]
[PhoneNumber] Matches a US-style (xxx) xxx-xxxx number with a variety of punctuation marks. The matching text is captured into 3 successive $variables
[EmailAddress] Matches email addresses. The name and domain parts are captured into 2 successive $variables
[IPAddress] Matches numeric IP addresses. The matching text  is captured into 4 successive $variables
[CreditCard] Matches credit card numbers with a variety of punctuation marks. The matching text is captured into 4 successive $variables
[Hyperlink] Matches a ftp, http, https, telnet, gopher or nntp internet url. The matching text is captured into 3 successive $variables
[DuplicateWord] Matches a repeated word. The matching text is captured into 2 successive $variables
[PageNumber] Matches a page number of the following forms:
Page dd
Page No dd
Page No. dd
Page Num. dd
Pg Num dd
Page Number dd.

The matching text is captured into 3 successive $variables (Page, Number, #)

data processing patterns (in TextPipe 6.8.2 and later)
[CSVfield] A Comma-Separated-Value field. If fields are delimited by single or double quotes, embedded newlines are allowed, as are doubled-up quotes. The quotes are returned as part of the match.
[TABfield] A Tab-delimited field. To process multiple tab fields e.g.
  [ 3 or more ( TABfield tab) TABfield ]
date and time patterns
[Date] Matches a date format DD-MM-YY or DD-MMM-YY
[AMPM] The AM/PM part of a time
[Month] A MonthName or a MonthNumber
[MonthNumber] 1-12, with an optional leading zero
[MonthName] January-December and Jan-Dec
[Day] 1-31
[DayOfYear] 1..366
[Year] A 2 or 4 digit year (between 1800 and 2199)
[Hour] A 12 or 24-hour hour, with optional leading zero
[Minute] A 2 digit minute
[Second] A 2 digit second


Using the real world patterns above, you can easily construct the following EasyPatterns:
 

HMS [ Hour <:.-> Minute <:.-> Second ]
DMY [ Day <-/ > Month <-/ > Year ]
MDY [ Month <-/ > Day <-/ > Year ]
YMD [ Year <-/ > Month <-/ > Day ]
Julian [ Year DayOfYear ]
MY [ Month <-/ > Year ]
MD [ Month <-/ > Day ]
DM [ Day <-/ > Month ]
HM [ Hour <:. > Minute ]

Complete Patterns: Putting it all Together

A complete pattern may include many individual keywords and many expressions. How do you know which keywords go together and where one expression stops and another begins? If in doubt, just enclose every expression in parentheses. But, EasyPattern has rules for combining keywords into expressions, so parentheses aren't always required. The traditional way of expressing these rules is to list the "precedence" of various operators or terms.:

  • (...), including numbered groups
  • [or] for characters sets and single-character literal
  • [not]
  • quantity specifiers
  • character set keywords (e.g. letter, digit) and single-character literals
  • multi-character literals
  • [or] as alternative, for groups and multi-character literals

Items with high precedence don't need parentheses; they group together automatically. For example, let's build a pattern step-by-step using the "high precedence" operators:
 [letter or digit] « "or" for characters set keywords
 [letter or digit or '.'] « and single-character literal
 [letter or digit or '.' or <!?>] « and arbitrary set
 [not letter or digit or '.' or <!?>] « reverse the meaning with not
 [1+ not letter or digit or '.' or <!?>] « add a quantity specifier
 [1+ (not (letter or digit or '.' or <!?>))] « if you like parentheses, though the meaning is the same

Adding lower precedence terms before, after or both doesn't change the grouping, though the expression is long enough that you may find a pair of commas, brackets, or parentheses helpful. As long as you understand how EasyPattern is doing the grouping, it doesn't matter whether you choose commas, brackets or parentheses. If the parentheses are added around something that is already a group, they don't change the meaning.
 [punctuation 1+ not letter or digit or '.' or <!?> symbol]
 [punctuation, 1+ not letter or digit or '.' or <!?>, symbol] « same meaning but easier to read
 [punctuation][1+ not letter or digit or '.' or <!?>][symbol] « same meaning
 [punctuation (1+ not letter or digit or '.' or <!?>) symbol] « same meaning

Remember, commas and brackets don't change the meaning, only the look. If you put them in the middle of high precedence terms, you might confuse yourself:
 [punctuation 1+ not letter][or][digit or '.' or <!?> symbol] « same meaning but HARDER to read
 [punctuation 1+ not letter, or, digit or '.' or <!?> symbol] « same meaning but HARDER to read

Only parentheses change the meaning:
 [(punctuation 1+ not letter) or (digit or '.' or <!?> symbol)] « different meaning

Note that "or" for character sets and "or" as alternative have opposite precedence. See Character Sets and Alternatives (above) for details & examples.

EasyPattern vs. perl regex or grep

At its core, EasyPattern uses "regular expression" technology that is similar to the "regex" or "grep" tools that originated on UNIX. EasyPattern's primary benefit is that the patterns are much easier to read and write.

For those who have some experience with regex, here are a few specific differences:

  • Quantity is specified as a prefix rather than a suffix. We believe prefix notation is much more natural.
     e.g. "[1+ digit]" rather than "[0-9]+"
  • Parentheses groups are not automatically numbered. Drawback (to some): you have to include a number if you want to refer to that matched portion. Benefits: the numbers don't change when you add other parentheses, you can number only the groups that you want to use (the parentheses that are there just for logical grouping don't get numbered).
  • No backslashes are required to "escape" special characters (instead, EasyPattern provides keywords such as rightBracket). Benefit: Other pattern languages already use backslash as an escape character so extra backslashes make patterns even more difficult to read.
  • EasyPattern includes keywords for many character sets that require a custom bracketed set in regex, e.g. punctuation, whitespace, paragraph, column, etc.
  • EasyPattern keywords generally include Macintosh-specific characters, e.g. [letter] includes letters with umlauts and other diacritical marks
  • EasyPattern can combine character sets with "or" (as well as use "or" for alternatives).
  • EP's [character] or [char] will match any character; the "equivalent" in some products will match anything except carriage return. If you want any character except a return (or formfeed), use [paragraphChar]; that is, any character that could appear in a paragraph. Of course, [not return] works too.

 

 

 Contact Us   Support   Community   Tutorials and User Guides (online)
 © 1999-2005 Crystal Software. All rights reserved.